Improving the performance of focused web crawlers
نویسندگان
چکیده
This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths leading to relevant pages. A novel learning crawler inspired by a previously proposed Hidden Markov Model (HMM) crawler is described as well. The crawlers have been implemented using the same baseline implementation (only the priority assignment function differs in each crawler) providing an unbiased evaluation framework for a comparative analysis of their performance. All crawlers achieve their maximum performance when a combination of web page content and (link) anchor text is used for assigning download priorities to web pages. Furthermore, the new HMM crawler improved the performance of the original HMM crawler and also outperforms classic focused crawlers in searching for specialized topics.
منابع مشابه
Topical Crawling for Business Intelligence
The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. Generalpurpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate in-dept...
متن کاملMining the web with hierarchical crawlers - a resource sharing based crawling approach
An important component of any web search engine is its crawler, which is also known as robot or spider. An efficient set of crawlers make any search engine more powerful, apart from its other measures of performance, such as its ranking algorithm, storage mechanism, indexing techniques, etc. In this paper, we have proposed an extended technique for crawling over the World Wide Web (WWW) on beha...
متن کاملEvaluation of a Graph-based Topical Crawler
Topical (or, focused) crawlers have become important tools in dealing with the massiveness and dynamic nature of the World Wide Web. Guided by a data mining component that monitors and analyzes the boundary of the set of crawled pages, a focused crawler selectively seeks out pages on a pre-defined topic. Recent research indicates that both the textual content of web pages and the structural inf...
متن کاملExpanding Reinforcement Learning Approaches for Efficient Crawling the Web
The amount of accessible information on World Wide Web is increasing rapidly, so that a general-purpose search engine cannot index everything on the Web. Focused crawlers have been proposed as a potential approach to overcome the coverage problem of search engines by limiting the domain of concentration of them. Focused crawling is a technique which is able to crawl particular topical portions ...
متن کاملA Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections
The Web, containing a large amount of useful information and resources, is expanding rapidly. Collecting domain-specific documents/information from the Web is one of the most important methods to build digital libraries for the scientific community. Focused Crawlers can selectively retrieve Web documents relevant to a specific domain to build collections for domain-specific search engines or di...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Data Knowl. Eng.
دوره 68 شماره
صفحات -
تاریخ انتشار 2009